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Abstract 

A self-organizing approach is proposed for gene finding based on the model of codon usage for coding 
regions and positional preference for noncoding regions. The symmetry between the direct and reverse 
coding regions is adopted for reducing the number of parameters. Without requiring prior training, 
parameters are estimated by iteration. By employing the window sliding technique and likelihood ratio, 
a very accurate segmentation is obtained. 

PACS number(s): 87.15. Cc, 87.14.Gg, 87.10.+e 

The data of raw DNA sequences is increasing at a phenomenal pace, providing a rich source of data 
to study. As a consequence, we now face the tremendous challenge of extracting information from the 
formidable volume of DNA sequence data. Computational methods for reliably detecting protein-coding 
regions are becoming more and more important. 

Genome annotation by statistical methods is based on various statistical models of genomic sequences 
10, 1^, one of the most popular being the inhomogeneous, three-period Markov chain model for protein-coding 
regions with an ordinary Markov model for noncoding regions. The independent random chain model can 
be included in this category by regarding it as a Markov chain of order 0. The codon usage model is the 
independent random chain model of non-overlapping triplets, and corresponds to an inhomogeneous Markov 
model of order 2. Signals in a short segment are usually buried in large fluctuations. With well chosen 
parameters statistical models work as a noise filter to pick out the signals. 

Methods based on local inhomogeneity, e.g. position asymmetry or periodicity of period 3, suffer fluc- 
tuations. Most of the current computer methods for locating genes require some prior knowledge of the 
sequence's statistical properties such as the codon usage or positional preference B ^, |^. That is, a sizable 
training set is necessary for estimating good parameters of the model in use ||, 0. Strongly biased by the 
training, such models have little power to discover surprising or atypical features. Thus, it is desirable to 
decipher the genomic information in an objective way. Audic and Claverie jsj have proposed a method which 
does not require learning of species-specific features from an arbitrary training set for predicting protein- 
coding regions. They use an ab initio iterative Markov modeling procedure to automatically partition genome 
sequences into direct coding, reverse coding, and noncoding segments. This is an expectation-maximization 
(EM) algorithm, which is useful in modeling with hidden variables, and is performed in two steps of ex- 
pectation and maximization ^ |l^. Such a self-organizing or adaptive approach uses all the available 
unannotated genomic data for its calibration. 

Before introducing the model we use and describing the technical details, we explain the EM algorithm 
with a simple pedagogic model which assumes that a DNA sequence written in four letters {a, c, t} is 
generated by independent tosses of two four-sided dice. An annotation maps the DNA sequence site-to-site 
to a two-letter sequence of the alphabet {C, N} (C for coding and N for noncoding). Two sets {pa,Pc,Pg,Pt} 
and {qa, 9c, Qg, <lt} of positional nucleotide probabilities are associated with the two dice C and N, respectively. 
The total probability for the given DNA sequence S = SiS2 ■ ■ ■ to be seen under the model is the partition 
or likelihood function 
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where the summation is over all the possible "annotations" H = {Ha} with = h'^h'^ ■ ■ ■, hf E {N,C}, 
and P{s\C) = Ps, P{s\N) = g^. The unknown two sets of probabilities can be determined by maximizing 
the likelihood Z. Prom Bayesian statistics 

with prior P{Ha) assumed, the most possible Ha can then be selected as the inferred annotation. As we 
know, coding regions are organized in blocks. The first simplification is the window coarse-graining. The 
sequence S is divided into nonoverlapping window segments of constant length w, and each whole window 
is entirely assigned to either N or C. Conducting Bayesian analysis for window Wj and accepting uniform 
prior, we have P(h\Wj) oc P{Wj\h). The second simplification is to introduce "temperature" r (as in the 
simulated annealing), replace P{Wj\hj) with [P(Tyj and take the limit r ^ 0. In this way we keep 
only a single term, i.e. the greatest one, in the summation for Z. Window W is inferred to belong to either 
N or C depending on whether P(W\N) or P(W\C) is larger. The likelihood maximization is then equivalent 
to estimating nucleotide probabilities with frequencies in two window classes inferred from the pre-assumed 
{ps} and {Qs}- Consistency requires that the estimate probabihties must be equal to {ps} and {qs}- This 
"fixed point" can be found by iteration. As an example, we use the first 99 x 5 051 = 500 049 nucleotides of 
the complete genome of E. coli as the input data. Statistical significance requires that the window size cannot 
be too small, while a large window size would give poor resolution in discriminating different regions. The 
window size is chosen to be w = 99. We assign the 5 051 fixed nonoverlapping windows to the two subsets of 
TV and C in either a periodic or a random way. We estimate ps and Qs from the counts of different nucleotides 
in each subset. The likelihood functions for each window are then calculated using the estimated ps and qs, 
and the assignment of the windows to C or iV is updated according to which of P{W\C) and P(W\N) is 
larger. This ends one iteration. The process of iteration converges to a single fixed point of precision 10~* 
around step 28 for different initializations with a final window assignment to N and C also given. The final 
Ps and qs are {0.219, 0.270, 0.289, 0.222} and {0.279, 0.213, 0.227, 0.281}. The qs estimated from the complete 
genome are {0.285,0.214,0.218,0.283}, which are rather close to the corresponding convergent values. 

More realistic models take the three phases in the coding regions and the opposite ordering of the direct 
and reverse coding regions into account. Such models adopt 7 subsets: one for noncoding (N), three for direct 
coding (Ci, C2, C3) and three for reverse coding (C4, C5, Cq). The subscript i in C, indicates the phase 0, 1 
or 2 accordng to i (mod 3). From the genomic data statistics, we may assume that there is symmetry between 
the direct and reverse coding regions, which means that a reverse coding sequence is indistinguishable from 
a direct coding sequence if we make the exchanges a ^ t, c ^ g and reverse the order. For the model 
based on the positional preference of codons, instead of 7 sets of positional nucleotide probabilities, we need 
only 4 sets. The reduction of the total number of parameters by the symmetry consideration improves the 
statistics. The procedure of iteration is similar to that for the last model, the only difference being that now 
we have to estimate 4 sets of probabilities and calculate 7 likelihood functions for a window. 

We use a better model based on the codon usage. We now need a set of 64 probabilities for coding 
regions. For noncoding segments, 4 positional nucleotide probabilities are used just as before. To simplify 
the programming, we move the windows with a phase-shift other than zero by one or two nucleotides to clear 
the phase-shift, although we can calculate the marginal distribution probabilities for uni- and bi-nucleotides. 
For example, we replace the window W = SjSi+i . . . Si+w-i marked as C2 with W = • • • Si+w (The 

alternative way is to consider a cyclic transformation.) Our further discussions are all based on this model. 
It is observed that the iteration also quickly converges to a fixed point. Contrary to the two-sets model 
where coding and noncoding are symmetric, and extra knowledge is required to relate one set to coding and 
the other to noncoding, we can now distinctly distinguish coding from noncoding regions, even with their 
phases fixed. Direct and reverse coding sets are symmetric in the model. However, the fact that stop codons 
tea, tag and tga are rare can be used to remove the symmetry between direct and reverse coding. That is, if 
the convergent probabilities for taa, tag and ^170 are all significantly small in comparison with the other 61, 
sets Ci, C2 and C3 then do indeed correspond to direct coding. (Otherwise, those of tta, eta and tea would 
be small instead.) 

We employ the sliding window technique to improve the resolution as follows. We shift each window 
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by 3 nucleotides, initiate the window assignment with the convergent probabihties just obtained, and then 
find new assignments for the shifted windows by iteration. We repeat the shifting process 32 times to cover 
the window width. This ends with 33 assignments for triplets, except for a few sites at the two ends. By a 
majority vote we can obtain a triplet assignment of the whole sequence. 

Recently, an entropic segmentation method that uses the Jensen-Shannon measure for sequences of a 
12-letter alphabet has been proposed to find borders between coding and noncoding regions Their best 
result was obtained on the genome of the bacterium Rickettsia prowazekii. We test our approach with the 
same genome data. To inspect the accuracy we obtain the "true" assignment of sites based on the known 
annotation as follows. If a nucleotide is in a noncoding region, it belongs to N. If it is in a coding (or 
reverse coding) region and the site-index of the beginning nucleotide plus 1 is congruent to i modulo 3, the 
nucleotide under consideration will belong to Ci+i (or d+i). For overlapping coding zones we may keep 
two alternative assignments. We define three rates of accuracy i?2, ^3 and R7: R2 only discriminates coding 
from noncoding segments while i?7 covers full discrimination of the 7 sets, and i?3 ignores the phases. For 
the total N = 1111523 nucleotides, we obtain R2 = 91.7%, i?3 = 89.8% and i?7 = 89.7%. (The rates 
without window sHding are R2 = 89.1%, E3 = 84.8% and i?7 = 84.4%.) 

For finding block borders, to eliminate illusary fiuctuations we accept only the assignments with the 33 
identical samplings, and regard others as undetermined. When two adjacent identified blocks are of the 
same assignment we join the two together with the sites between into a single zone of the same assignment. 
Otherwise, we take the middle site of the intervening undetermined zone as the border, and assign the two 
sides according to their corresponding flank blocks. We can do the job better by means of the likefihood 
ratio. Suppose that the left block is assigned to and the right to r. A point m in the intervening zone 
divides the zone into two segments and Rm- The likelihood ratio is defined as 



Fm — 
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The maximal F,„ places the border at m. This segmentation finally gives the accuracy rates R2 = 93.3%, 
i?3 = 92.8% and i?7 = 92.7%. 

In Ref. jl^ the quantity quantifying the coincidence between borders inferred from the segmentation and 
those from the known annotation is defined by 
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where {bi} is the set of all borders between coding and noncoding regions, and {cj} is the set of all cuts 
produced by the segmentation. We use an even harsher quantity D by interpreting {bi} and {cj} as the 
borders of all coding zones. That is, we include borders of each overlapping coding zone. The total number 
of "CDS" in the annotation is 834, one of which has two joint zones. We obtain 1 — D ^ 87.7%, compared 
with ^ 80% of Ref. In Fig. 1 we show a comparison of the inferred segmantation with the known 

coding regions. In the section from 475 500 to 497 500 there are two overlaps (one for direct, and the other 
for reverse coding regions), and the shortest gap separating adjacent coding regions is just 1 nucleotide (at 
486 215). They do not escape detection. As mentioned in there are two very close coding regions in 
the same phase (538197 : 539 879 and 539 937 : 540 887). The result from the majority vote is shown in 
Fig. 2 for the section. We see indeed a peak of the counts for set N between the two coding regions. The 
highest count for N is 32, and so is ignored in our strategy. There is indeed plenty of room for improving 
this approach. A larger width w — 123 gives higher accuracy rates: i?2 = 93.6%, R3 — 93.3%, R7 — 93.0% 
and 1 — D = 88.2%. When we consider only the triplets with all 33 assignments identical in window sliding 
the rates are R2 = 98.7%, R3 = 98.6% and Rj ~ 98.6%. In the above we avoid setting up an arbitary 
cut-off threshold. If a threshold of 17 counts is used to determine the segments whose central parts have 
33 identical samplings, for w = 99 we predict a total of 1 001 351 (90.1%) sites with accuracies R2 = 97.4% 
and i?3 = 95.4%. The accuracy rate for noncoding regions is 96.5%, much higher than that of Ref. 1^. It 
is important and feasible to integrate biological signals into our algorithm. We expect our algorithm, with 
certain modifications, should work well for other species, too. 
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Figure 1: Comparison between the inferred segmentation (dotted lines) and the known coding regions of 
Rickettsia (shaded areas). 



Figure 2: Counts of majority assignment in the section containing two very close coding regions (shaded 
areas) in the same phase. A peak corresponding to noncoding assignment is clearly seen. 
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